Semi-Supervised Web Wrapper Repair via Recursive Tree Matching

نویسندگان

  • Joseph Paul Cohen
  • Wei Ding
  • Abraham Bagherjeiran
چکیده

Continuous data extraction pipelines using wrappers have become common and integral parts of businesses dealing with stock, flight, or product information. Extracting data from websites that use HTML templates is difficult because available wrapper methods are not designed to deal with websites that change over time (the inclusion or removal of HTML elements). We are the first to perform large scale empirical analyses of the causes of shift and propose the concept of domain entropy to quantify it. We draw from this analysis to propose a new semi-supervised search approach called XTPath. XTPath combines the existing XPath with carefully designed annotation extraction and informed search strategies. XTPath is the first method to store contextual node information from the training DOM and utilize it in a supervised manner. We utilize this data with our proposed recursive tree matching method which locates nodes most similar in context. The search is based on a heuristic function that takes into account the similarity of a tree compared to the structure that was present in the training data. We systematically evaluate XTPath using 117,422 pages from 75 diverse websites in 8 vertical markets that covers vastly different topics. Our XTPath method consistently outperforms XPath and a current commercial system in terms of successful extractions in a blackbox test. We are the first supervised wrapper extraction method to make our code and datasets available (online here: http://kdl.cs.umb.edu/w/datasets/).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Survey of Unsupervised Techniques for Web Data Extraction

World Wide Web contains a large amount of data and to fetch important information from web has become a useful task. There are many web information extraction systems are developed and categorised in manual, supervised, semisupervised and unsupervised techniques. We will study unsupervised techniques and how they differ from each other. Roadrunner uses match algorithm for generating the wrapper...

متن کامل

Semi-Automatic Wrapper Generation for Commercial Web Sources

Semi-automatic wrapper generation tools aim to ease the task of building structured views over semi-structured web sources. But the wrapper generation techniques presented up to date are unable to properly deal with sources requiring complex navigational sequences for accessing data. In this paper, we present Wargo, a semi-automatic wrapper generation tool, which has been used by non-programmer...

متن کامل

Automatic Wrapper Generation and Maintenance

This paper investigates automatic wrapper generation and maintenance for Forums, Blogs and News web sites. Web pages are increasingly dynamically generated using a common template populated with data from databases. This paper proposes a novel method that uses tree alignment and transfer learning method to generate the wrapper from this kind of web pages. The tree alignment algorithm is adopted...

متن کامل

Dual Teaching: A Practical Semi-supervised Wrapper Method

Semi-supervised wrapper methods are concerned with building effective supervised classifiers from partially labeled data. Though previous works have succeeded in some fields, it is still difficult to apply semi-supervised wrapper methods to practice because the assumptions those methods rely on tend to be unrealistic in practice. For practical use, this paper proposes a novel semi-supervised wr...

متن کامل

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Information distributed through the Web keeps growing faster day by day, and for this reason, several techniques for extracting Web data have been suggested during last years. Often, extraction tasks are performed through so called wrappers, procedures extracting information from Web pages, e.g. implementing logic-based techniques. Many fields of application today require a strong degree of rob...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1505.01303  شماره 

صفحات  -

تاریخ انتشار 2015